Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented May 21, 2021

Consider these cases:

a. Component A is in a state that allows updates, and nothing in the rest of the cluster would break if A updated.
b. Component A is in a state that allows updates, but component B (which is in-cluster, but not part of A) would break if A updated.
c. Component A would break if it updated.

Operator A should pretty clearly be Upgradeable=True for (a) and Upgradeable=False for (c).

Before this commit, a narrow reading of the comment would have operator A be Upgradeable=True for (b). This commit moves it to Upgradeable=False, based on discussion in openshift/enhancements#762, where it becomes the job of the API-server to set Upgradeable=False if updating the API-server would break nodes running old kubelets. The API-server can say "to unblock minor updates, update your kubelets". The machine-config operator will simultaneously say "hey, your kubelets are old, and here's how to update: $STEPS", but it won't use Upgradeable=False to say that (because the machine-config operator would be happy to have its component nodes updated).

As pointed out in discussion in openshift/enhancements#762, this is a bit of a bottomless pit. For example, component A may be removing a deprecated feature on update, and there may be user workloads that occasionally depend on that feature but hardly ever use it. Component A might reasonably think "nobody has used $OUTGOING_FEATURE in the last week, so I'm Upgradeable=True", and then post-update, the user-workload would go to hit the removed API and break. And obviously in-cluster components will have even more limited access to any out-of-cluster components that depend on them. So using Upgradeable=False to protect other components from breaking is going to be a best-effort sort of thing. But this commit pivots so that it's more clear that we'll put that effort in when we can.

…r scope

Consider these cases:

a. Component A is in a state that allows updates, and nothing in the
   rest of the cluster would break if A updated.
b. Component A is in a state that allows updates, but component B
   (which is in-cluster, but not part of A) would break if A updated.
c. Component A would break if it updated.

Operator A should pretty clearly be Upgradeable=True for (a) and
Upgradeable=False for (c).

Before this commit, a narrow reading of the comment would have
operator A be Upgradeable=True for (b).  This commit moves it to
Upgradeable=False, based on discussion in [1], where it becomes the
job of the API-server to set Upgradeable=False if updating the
API-server would break nodes running old kubelets.  The API-server can
say "to unblock minor updates, update your kubelets".  The
machine-config operator will simultaneously say "hey, your kubelets
are old, and here's how to update: $STEPS", but it won't use
Upgradeable=False to say that (because the machine-config operator
would be _happy_ to have its component nodes updated).

As pointed out in discussion in [1], this is a bit of a bottomless
pit.  For example, component A may be removing a deprecated feature on
update, and there may be user workloads that occasionally depend on
that feature but hardly ever use it.  Component A might reasonably
think "nobody has used $OUTGOING_FEATURE in the last week, so I'm
Upgradeable=True", and then post-update, the user-workload would go to
hit the removed API and break.  And obviously in-cluster components
will have even more limited access to any out-of-cluster components
that depend on them.  So using Upgradeable=False to protect other
components from breaking is going to be a best-effort sort of thing.
But this commit pivots so that it's more clear that we'll put that
effort in when we can.

[1]: openshift/enhancements#762
@wking wking force-pushed the expand-upgradeable-scope-to-cluster-state branch from d7dc014 to 1a82848 Compare May 21, 2021 17:52
@openshift-ci openshift-ci bot requested review from mfojtik and soltysh May 21, 2021 17:52
@wking
Copy link
Member Author

wking commented May 21, 2021

I'm fuzzy on this comment about the API-server being a client of the kubelet for exec flows. Perhaps that is sufficient to get the kubelet-skew-guard in under (c)? And maybe we want the godocs to be generic enough that operator B could say "hey, A is going to fast, wait for me to catch up" would be possible for cases where A isn't smart enough to notice B falling behind? Wording would be something like:

Upgradeable indicates whether the operator considers the OpenShift core safe to upgrade based on the current cluster state.

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 2, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 2, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: soltysh, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 2, 2021
@openshift-merge-robot openshift-merge-robot merged commit 2deea64 into openshift:master Jun 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants